NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

GlyphPattern: An Abstract Pattern Recognition Benchmark for Vision-Language Models

Wu, Zixuan; Kim, Yoolim; Anderson, Carolyn Jane (July 2025, Association for Computational Linguistics)
Che, Wanxiang; Nabende, Joyce; Shutova, Ekaterina; Pilehvar, Mohammad Taher (Ed.)
Vision-Language Models (VLMs) have made rapid progress in reasoning across visual and textual data. While VLMs perform well on vision tasks that they are trained on, our results highlight key challenges in abstract pattern recognition. We present GlyphPattern, a 954 item dataset that pairs 318 human-written descriptions of visual patterns from 40 writing systems with three visual presentation styles.GlyphPattern evaluates abstract pattern recognition in VLMs, requiring models to understand and judge natural language descriptions of visual patterns. GlyphPattern patterns are drawn from a large-scale cognitive science investigation of human writing systems; as a result, they are rich in spatial reference and compositionality. Our experiments show that GlyphPattern is challenging for state-of-the-art VLMs (GPT-4o achieves only 55% accuracy), with marginal gains from few-shot prompting. Our detailed analysis reveals errors at multiple levels, including visual processing, natural language understanding, and pattern generalization.
more » « less
Free, publicly-accessible full text available July 1, 2026
''I Would Have Written My Code Differently'': Beginners Struggle to Understand LLM-Generated Code

https://doi.org/10.1145/3696630.3731663

Zi, Yangtian; Li, Luisa; Guha, Arjun; Anderson, Carolyn Jane; Feldman, Molly Q (July 2025, ACM)

Free, publicly-accessible full text available July 28, 2026
Substance Beats Style: Why Beginning Students Fail to Code with LLMs

https://doi.org/10.18653/v1/2025.naacl-long.433

Lucchetti, Francesca; Wu, Zixuan; Guha, Arjun; Feldman, Molly Q; Anderson, Carolyn Jane (April 2025, Association for Computational Linguistics)

Although LLMs are increasing the productivity of professional programmers, existing work shows that beginners struggle to prompt LLMs to solve text-to-code tasks (Nguyen et al., 2024; Prather et al., 2024; Mordechai et al., 2024). Why is this the case? This paper explores two competing hypotheses about the cause of student-LLM miscommunication: (1) students simply lack the technical vocabulary needed to write good prompts, and (2) students do not understand the extent of information that LLMs need to solve code generation tasks. We study (1) with a causal intervention experiment on technical vocabulary and (2) by analyzing graphs that abstract how students edit prompts and the different failures that they encounter. We find that substance beats style: a poor grasp of technical vocabulary is merely correlated with prompt failure; that the information content of prompts predicts success; that students get stuck making trivial edits; and more. Our findings have implications for the use of LLMs in programming education, and for efforts to make computing more accessible with LLMs.
more » « less
Free, publicly-accessible full text available April 29, 2026
Non-Expert Programmers in the Generative AI Future

https://doi.org/10.1145/3663384.3663393

Feldman, Molly Q; Anderson, Carolyn Jane (June 2024, ACM)

Generative AI is rapidly transforming the practice of programming. At the same time, our understanding of who writes programs, for what purposes, and how they program, has been evolving. By facilitating natural-language-to-code interactions, large language models for code have the potential to open up programming work to a broader range of workers. While existing work finds productivity benefits for expert programmers, interactions with non-experts are less well-studied. In this paper, we consider the future of programming for non-experts through a controlled study of 67 non-programmers. Our study reveals multiple barriers to effective use of large language models of code for non-experts, including several aspects of technical communication. Comparing our results to a prior study of beginning programmers illuminates the ways in which a traditional introductory programming class does and does not equip students to effectively work with generative AI. Drawing on our empirical findings, we lay out a vision for how to empower non-expert programmers to leverage generative AI for a more equitable future of programming.
more » « less
Full Text Available
Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

https://doi.org/10.1145/3689735

Cassano, Federico; Gouwar, John; Lucchetti, Francesca; Schlesinger, Claire; Freeman, Anders; Anderson, Carolyn Jane; Feldman, Molly Q; Greenberg, Michael; Jangda, Abhinav; Guha, Arjun (October 2024, Proceedings of the ACM on Programming Languages)

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming language. Code LLMs produce impressive results on high-resource programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available (e.g., OCaml, Racket, and several others). This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, called MultiPL-T, generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. MultiPL-T translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize unit tests for commented code from a high-resource source language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate the code from the high-resource source language to a target low-resource language. This gives us a corpus of candidate training data in the target language, but many of these translations are wrong. 3) We use a lightweight compiler to compile the test cases generated in (1) from the source language to the target language, which allows us to filter our obviously wrong translations. The result is a training corpus in the target low-resource language where all items have been validated with test cases. We apply this approach to generate tens of thousands of new, validated training items for five low-resource languages: Julia, Lua, OCaml, R, and Racket, using Python as the source high-resource language. Furthermore, we use an open Code LLM (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. Using datasets generated with MultiPL-T, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket that outperform other fine-tunes of these base models on the natural language to code task. We also present Racket fine-tunes for two very recent models, DeepSeek Coder and StarCoder2, to show that MultiPL-T continues to outperform other fine-tuning approaches for low-resource languages. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.
more » « less
Full Text Available
How Beginning Programmers and Code LLMs (Mis)read Each Other

https://doi.org/10.1145/3613904.3642706

Nguyen, Sydney; Babe, Hannah McLean; Zi, Yangtian; Guha, Arjun; Anderson, Carolyn Jane; Feldman, Molly Q (May 2024, ACM)

Generative AI models, specifically large language models (LLMs), have made strides towards the long-standing goal of text-to-code generation. This progress has invited numerous studies of user interaction. However, less is known about the struggles and strategies of non-experts, for whom each step of the text-to-code problem presents challenges: describing their intent in natural language, evaluating the correctness of generated code, and editing prompts when the generated code is incorrect. This paper presents a large-scale controlled study of how 120 beginning coders across three academic institutions approach writing and editing prompts. A novel experimental design allows us to target specific steps in the text-to-code process and reveals that beginners struggle with writ- ing and editing prompts, even for problems at their skill level and when correctness is automatically determined. Our mixed-methods evaluation provides insight into student processes and perceptions with key implications for non-expert Code LLM use within and outside of education.
more » « less
Full Text Available
Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

Cassano, Federico; Li, Luisa; Sethi, Akul; Shinn, Noah; Brennan-Jones, Abby; Ginesin, Jacob; Berman, Edward; Chakhnashvili, George; Lozhkov, Anton; Anderson, Carolyn Jane; et al (July 2024, First Conference on Language Modeling)

A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.
more » « less
Full Text Available
Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

Cassano, Federico; Li, Luisa; Sethi, Akul; Shinn, Noah; Brennan-Jones, Abby; Ginesin, Jacob; Berman, Edward; Chakhnashvili, George; Lozhkov, Anton; Anderson, Carolyn Jane; et al (July 2024, Conference on Language Modeling)

A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.
more » « less
Full Text Available
StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

https://doi.org/10.18653/v1/2024.findings-acl.501

Babe, Hannah; Nguyen, Sydney; Zi, Yangtian; Guha, Arjun; Feldman, Molly; Anderson, Carolyn (January 2024, Association for Computational Linguistics)

Code LLMs have the potential to make it easier for non-experts to understand and write code. However, current CodeLLM benchmarks rely on a single expert-written prompt per problem, making it hard to generalize their success to non-expert users. In this paper, we present a new natural-language-to-code benchmark of prompts written by a key population of non-experts: beginning programmers. StudentEval contains 1,749 prompts written by 80 students who have only completed one introductory Python course. StudentEval contains numerous non-expert prompts describing the same problem, enabling exploration of key factors in prompt success. We use StudentEval to evaluate 12 Code LLMs and find that StudentEval is a better discriminator of model performance than existing benchmarks. Our analysis of student prompting strategies reveals that nondeterministic LLM sampling can mislead students about the quality of their descriptions, a finding with key implications for Code LLMs in education.
more » « less
Full Text Available
MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

https://doi.org/10.1109/TSE.2023.3267446

Cassano, Federico; Gouwar, John; Nguyen, Daniel; Nguyen, Sydney; Phipps-Costin, Luna; Pinckney, Donald; Yee, Ming-Ho; Zi, Yangtian; Anderson, Carolyn Jane; Feldman, Molly Q; et al (April 2023, IEEE Transactions on Software Engineering)
Michael Pradel (Ed.)
Large language models have demonstrated the ability to generate both natural language and programming language text. Although contemporary code generation models are trained on corpora with several programming languages, they are tested using benchmarks that are typically monolingual. The most widely used code generation benchmarks only target Python, so there is little quantitative evidence of how code generation models perform on other programming languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark and MBPP benchmark to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex, CodeGen and InCoder. We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.
more » « less
Full Text Available

« Prev Next »

Search for: All records